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Abstract 

We illustrate the use of tools (asymptotic theories of standard er¬ 
ror quantification using appropriate statistical models, bootstrapping, 
model comparison techniques) in addition to sensitivity that may be 
employed to determine the information content in data sets. We do 
this in the context of recent models [23| for nucleated polymerization 
in proteins, about which very little is known regarding the underlying 
mechanisms; thus the methodology we develop here may be of great 
help to experimentalists. 
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1 Introduction 


As mathematical models become more complex with multiple states and 
many parameters to be estimated using experimental data, there is a need 
for critical analysis in model validation related to the reliability of parameter 
estimates obtained in model fitting. A recent concrete example involves pre¬ 
vious HIV models [US] with 15 or more parameters to be estimated. In [I], 
using recently developed parameter selectivity tools |5] based on parameter 
sensitivity based scores, it was shown that many parameters could not be es¬ 
timated with any degree of reliability. Moreover, we found that quantifiable 
uncertainty varies among patients depending upon the number of treatment 
interruptions (perturbations of therapy). This leads to a fundamental ques¬ 
tion: how much information with respect to model validation can be expected 
in a given data set or collection of data sets? 

Here we illustrate the use of other tools (asymptotic theories of standard 
error quantification using appropriate statistical models, bootstrapping, and 
model comparison techniques) in addition to sensitivity theory that may be 
used to determine the information content in data sets. We do this in the 
context of recent models [23] for nucleated polymerization in proteins. 

After presenting the biological context of amyloid formation, we describe 
the model in Section[21 In Sectional we investigate the statistical model to be 
used with our noisy data. This is a necessary step in order to use the correct 
error model in our generalized least squares (GLS) minimization. This also 
reveals information on our experimental observation process. Once we have 
found parameters which allow a reasonable fit, we determine the confidence 
we may have in our estimation procedures. We do this in Section [H using 
both the condition number of the covariance matrix and a sensitivity analysis. 
This reveals a smaller number of parameters (than those estimated in [23] ) 
which appear as reasonably sensitive to the data sets, whereas other do not 
really affect the quality of the fits to our data. To further support our 
sensitivity findings, we then apply a bootstrapping analysis in Section [5] We 
are lead to four main parameters and compare their resulting errors with the 
asymptotic confidence intervals of Section [U Finally, in Section [HI we carry 
out model comparison tests |8| 9] lID] as used in [3], and these lead us to 
select three well-defined parameters that can be reliability estimated out of 
the nine original ones estimated in [23] . 
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1.1 Protein Polymerization 

It is now known that several neuro-degenerative disorders, including Alzheimers 
disease, Huntingtons disease and Prion diseases e.g., mad cow, are related 
to aggregations of proteins presenting an abnormal folding. These protein 
aggregates are called amyloids and have become a focus of modeling efforts 
in recent years One of the main challenges in this field is 

to understand the key aggregation mechanisms, both qualitatively and quan¬ 
titatively. In order to test our methodology on a relatively simple case, we 
focus here on polyglutamine (PolyQ) containing proteins. This was also the 
case study chosen to illustrate the fairly general ODE-PDE model proposed 
in [23]; the reason for our choice is that, as shown in [23], the polymerization 
mechanisms prove to be simpler for PolyQ aggregation than for other types 
of proteins, e.g. PrP [21]. To understand data sets from experiments carried 
by Human Rezaei and his team at INRA, (Virologie et Immunologie Molec- 
ulaires), see [23], we adapt the general model to this context. The data sets 
(DS1-DS4) of interest to us here are depicted in Figure [T] below. 


Adimensionalized total polymerized mass for c 0 =200p mol 



Figure 1: The data sets of interest from [23] E]. 

In [23] and a subsequent effort in [7], the authors sought to investigate 
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several questions including (i) understanding the key polymerization mech¬ 
anisms, (ii) how to select parameters and calibrate the model, and (iii) how 
to numerically approximate the model. Here we briefly summarize results 
related to (iii) and focus primarily on (ii). 

2 The Model 

2.1 Original ODE Model 

This model we used is the same as that of [23] , We briefly outline that model. 
Let {V. V*, Ci ) be the concentrations of the normal monomeric proteins that 
we will call monomers, of the monomeric proteins presenting an abnormal 
configuration that we will call conformers, and of the z-polymers made of 
i aggregated abnormal proteins, respectively. The following comprise the 
fundamental dynamics modeled in [ 23j : 

k+ 

• Monomer-conformer exchange: V ^ V* 

k 7 

k N 

• Nucleation: V* + V* + ... + V* ^ c io 

' --- 

*o off 

• Polymerization by conformer addition: c t + V* ^ c l+ \ 


Other reactions like fragmentation and coalescence are negligible for the case 
of polyglutamine containing proteins (see [23] for experimental justification). 


The law of mass action in the deterministic framework (see mm and 

k+ 

the numerous references therein), translates A+B A'+B 1 into the ordinary 

k i 

differential equation ^ = —k + [A][B] + k~[A]'[B]'. 
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Using these basic ideas we obtain the infinite system of ordinary differ¬ 
ential equations (ODEs) studied in 

— ~ -kfV + kjV*, 


dt 

dV* 

dt 


= k+V - kjV + i„k",c ia -V’J2 K„Ci, 


i>io 


dc. 


dt 

dcj 

dt 


l _0_ _ 1 ,N (y*\io _ U N r — h l ° r V* 

~ fx ‘on\ V ) / V/7 C *0 fv on C iO V J 

= V*(K~nCi-l - K n Ci), i = i 0 + 1, .... 


with initial conditions 

V (0) = c 0 , V*(0) = 0, Ci o (0) = q( 0) = 0 
and the mass balance equation 


d 

dt 


v+v*+j2 ic A =°- 


l=l 0 


( 1 ) 

( 2 ) 

(3) 

(4) 


The experiments of interest to us measure the total polymerized mass, 


i.e., 


M(t) = 


i>io 


2.2 An Approximate PDE System and the Associated 
Forward Problem 

Since very long polymers (a fibril may contain up to 10 6 monomer units) 
characterize amyloid formations, a PDE version of the standard model, where 
a continuous variable x approximates the discrete sizes i, is a reasonable 
approximation for large amyloid polymers. However, for small polymer sizes 
this curarization does not work very well. Thus we take a ’’hybrid approach” 
of leaving the ODE for smaller sizes and use the PDE for larger ones, see plj. 

We define a small parameter e — A, and let Xi = ie with ^ 1 be the 
average polymer size defined by 

E lc i 

_ i>i 0 

~ -■ 

Ed 
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Then after definition of dimensionless quantities 

c £ {t,x) = ^2 c it [xuXi+l] 

we may obtain a partial differential equation (PDE) to replace the infinite 
ODE system. Rigorous derivations of such continuous integro-PDE models 
may be found in [20] for coagulation-fragmentation equations, in [IT] for the 
limit of the Becker-Doring system toward Lifshitz-Slyozov model, and in [18] 
for the growth-fragmentation ’’Prion Model”. A formal derivation for a full 
model, also including nucleation, is carried out in [ 23] . 

Let N 0 G N. We then use the approximation 


dV 

dt 

= -kfV + kjV*, 



dV* 

dt 

= kjV ~ kjV* + iok„ ff 

Do ~ V *Y1 k% on C ii 

(5) 


i>i 0 


© , 

HO 

= k^ n (v*y° - k? ffCl0 - 

po r . y* 

^on^o v i 

(6) 

dci 

dt 

= V*(K~n c i -1 - K n a ), 

i < N 0 , 

(7) 

d t c £ (x,t)- 

= - V*d x (k on c £ (x,t )), 

x > N 0 , 

(8) 


with initial conditions 


R(0) = c 0 , l/*(0) = 0, Cj o (0) = Cj(0) = 0, c £ (x, 0) = 0, 


and the boundary condition 


c e (x = N 0 ,t) = c No (t). 

Then an assumed mass balance equation becomes 


d 

dt 



= 0 . 


In [Tj we considered requirements for a good discretization scheme includ¬ 
ing (i) it should conserve the total polymerized mass, (ii) it should be fast 
and most importantly, (iii) it should be accurate. 
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To ensure the mass conservation, we replace the ODE for V* by the mass 
conservation equation and obtain 




d t c £ (x,t) = -V*d x (k on c £ (x,t )), x > N 0 , 


with initial and boundary conditions as before. 

We developed methodology for forward solutions in [7]. In considering 


these forward solutions we first observed that the desired spatial computa¬ 


tional domain is very large as determined by the maximum size of observed 
polymers, with range up to 10 6 and the peak in the distribution is at the 
left side of the domain of interest; for larger polymer sizes, the distribution 
is almost linearly decreasing. 

Based on these and other considerations discussed in [7], the PDE was 
approximated by the Finite Volume Method (see [21] for discussions of Up¬ 
wind, Lax-Wendroff and flux limiter methods) with an adaptive mesh, refined 
toward the smaller polymer sizes. Furthermore, we kept the ratio between 
the step size and the corresponding mesh element constant, i.e., we used 
= q < 1 so that x t = This mesh is quasi-linear in the sense 

of = 1 + O(q). The resulting Upwind and Lax-Wendroff schemes are 

then consistent on the progressive mesh (see m)- For further details on 
these schemes including examples demonstrating convergence properties, the 
interested reader may consult [7], 

3 The Inverse Problem 

A major question in formulating the model for use in inverse problem sce¬ 
narios consists of how to best parametrically represent the function k on for 
our application? Following |23], we chose to approximate k on by a function 
as depicted in Figure [2] (According to our discussions between S. Prigent, H. 
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Rezaei and J. Torrent, other choices like a Gaussian bell curve are also pos¬ 
sible, and we discuss this later). Thus with this parametrization we have 5 
more parameters kff n , kf!fi x , %i, £ 2 , imax in addition to the 4 basic parameters 
kf ,kj , kff, kgfj to be estimated using our data sets. 



Figure 2: Parametric representation for k on . 

Thus we seek to estimate (with acceptable quantification of uncertainties ) 
the nine parameters kf , kj , k^f , k^ n , and k on (represented in parametrical 
form depicted above with the 5 additional unknowns k™ n , kfffi x , x\, x 2 , imax) 
that fit the data best! To do this we need an efficient discretization method 
as discussed above for the forward problem as well as a correct assumption 
on the measurement errors in the inverse problem. 

3.1 Estimation of Parameters 

We make some standard statistical assumptions (see |9( H0 S HE (26]) underly¬ 
ing our inverse problem formulations. 

• Assume that there exists a true or nominal set of parameter 9 0 = 

{kj , ..., imax) 

• Let Si be iid with E (Si) = 0 and co v(Si, Sf) = a 2 . Let e* G Si. 
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Denote the estimated parameter for 9 0 as 9. The inverse problem is based 
on statistical assumptions on the observation error in the data. 


If we assume an absolute error data model then data points are taken with 
equal importance. This is represented by observations 

Hi = M(ti,6 0 ) + Cj. (9) 

On the other hand, if one assumes some type of relative error data model 
then the error is proportional in some sense to the measured polymerized 
mass. This can be represented by observations of the form 

Hi = M(ti,9 0 ) + M(t i , 6 » 0 ) 7 e i , ye (0,1]. (10) 

Absolute model error formulations dictate we use Ordinary Least Squares 
(OLS) inverse problem [9f [10] given by 


9 = argmin ^(yi - M(t h 9)f 


( 11 ) 


while for relative error model one should use inverse problem formulations 
with Generalized Least Squares (GLS) cost functional 

2 


9 = arg min 


y.i - M(tj, 9) 

M{ti,9)i 


7 6 ( 0 , 1 ]- 


( 12 ) 


3.1.1 The Residual Plots 


To obtain a correct statistical model, we used residual plots (see [9] HO] for 
more details) with residuals given by 


Vi - 

M{U,d¥ 


76 [ 0 , 1 ] 


To illustrate what we are seeking for our data sets, we first used simulated 
relative error data (simulated data for 7 = 1 ), then carried out the inverse 
problems for both a relative error cost functional (i.e., 7 = 1 ) and an ordinary 
least squares cost functional (i.e., 7 = 0). We then plotted the corresponding 
residuals vs time and also residuals vs the model values. The first plots are 
related to the correctness of our assumption of independency and identical 
distributions i.i.d. for the data whereas the second plots contain information 
as to the correctness of the form of our proposed statistical model. 
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0.03 


Residuals using GLS cost function, noise level: 0.01 


Residuals using OLS cost function, noise level: 0.01 
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(a) (b) 

Figure 3: Plots with simulated data: (a) Correct cost function vs. 
(7 = 1 ); (b)Incorrect cost function vs. time (7 = 0 ) 


Residuals using GLS cost function, noise level: 0.01 



Model 


Residuals using OLS cost function, noise level: 0.01 



(a) 


(b) 


Figure 4: Plots with simulated data: (a) Correct cost function vs. 
(7 = 1 ); (b)Incorrect cost function vs. model (7 = 0 ) 


time 


model 
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3.2 Statistical Models of Noise 


We next carried out similar inverse problems with data set (DS) 4 of our 
experimental data collection. We first used DS 4 on the interval t e [0,8]. 
Based on some earlier calculations we also chose the nucleation index = 2 
for all our subsequent calculations. The residual plots given below in Figures 
H 3 and IH] suggest strongly that neither of the first attempts of assumed statis¬ 
tical models and corresponding cost functionals (absolute error and OLS or 
relative error with 7 = 1 and simple GLS) are correct. 



Figure 5: (a) M(f fc ) with OLS; (b) Residuals vs Model: OLS 

Based on these initial results and the speculation that early periods of 
the polymerization process may be somewhat stochastic in nature, we chose 
to subsequently use all the data sets on the intervals [to, 8] where to is the 
first time when M(to) > 0.12 (thus 12 % of the total polymerized mass). 
Moreover, we decided to use other values of 7 between 0 and 1 to test data 
set 4. 

We thus carried out further investigations with inverse problems for data 
points M(tfe) > 0.12 and io = 2 where we focused on the question of the most 
appropriate values of 7 to use in a generalized least squares approach (again 
see [9] for further motivation and details). We then obtained the results with 
data set 4 depicted in Figure [7] 

Analysis of these residuals suggest that either 7 = 0.6 or 7 = 0.7 might 
be satisfactory for use in a generalized least squares setting. 
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Madim for GLS and dataset 4 


Residuals vs model for GLS, N0=500 and dataset 4 




Figure 6: (a) M(t fc ) with GLS, 7=1; (b) Residuals vs Model: GLS 
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Figure 7: Residuals for data set 4 using different values of 7. 


Motivated by these results, we next investigated the inverse problems 
for each of the four experimental data sets with initial concentration Cq = 
200/nnol and « 0 = 2. We carried out the optimization over all data points with 
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M(tk) > 0.12 and used the generalized least squares method with 7 = 0 . 6 . 
The resulting graphics depicted in Figure [S] again suggest that 7 = 0.6 is a 
reasonable value to use in our subsequent analysis of the polyglutamine data 
with regard to its information content for inverse problem estimation and 
parameter uncertainty quantification. 
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Figure 8 : Residuals for the 4 experimental data sets using 7 = 0.6. 
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4 Standard Errors and Asymptotic Analysis 


4.1 Standard Errors for Parameters Using GLS 

We employed first the asymptotic theory for parameter uncertainty summa¬ 
rized in iM] and the references therein. In the case of generalized 
least squares, the associated standard errors for the estimated parameters 
9 = (kf,...,imax ) (vector length kq = 9) are given by the following construc¬ 
tion (for details see Chap. 3.2.5 and 3.2.6 of [9]): 

Define the covariance matrix by the formula 

SE k = \Jz kk (9), k = 1,9, 

where 

m = a 2 ( X T (d)W(9) X (6))- 1 . 

Here x is the sensitivity matrix of size n x Kg (n being the number of data 
points and Kg being the number of estimated parameters) and W is defined 
by 

W~\Q) = diag(M(i 1 ; 9) 2j ,..., M(t n , 9)^). 


We use the approximation of the variance 


o" J 


m 2 = 


n 


m( u- eyE 


(■ M(ti,e ) - yif 


To obtain a hnite standard error using asymptotic theory, the 9x9 matrix 
F = x T (9)W(9)x(0) thus must be invertible. In the above problem we do 
indeed obtain a good fit of the curve and good residuals (for the sake of 
brevity, not depicted here!). However, we also found that the condition 
number of the matrix 

F = x T WW0)x(S) 

is k = 10 24 . Looking more closely at the matrix F reveals a near linear 
dependence between certain rows, hence the large condition number. We 
thus quickly reach the following conclusions : 


1. We obtain a set of parameters for which the model fits well, but we 
cannot have any reasonable confidence in them using the asymptotic 
theories from statistics e.g., see the references given above. 
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2. We suspect that it may not be possible to obtain sufficient information 
from our data set curves to estimate all 9 parameters with a high degree 
of confidence! This is based on our calculations with the corresponding 
Fisher matrices as well our prior knowledge in that the graphs depicted 
in Figure [T|are very similar to Logistic or Gompertz curves which can 
be quite well fit with parameterized models with only 2 or 3 carefully 
chosen parameters! 

To assist in initial understanding of these issues, we consider the associated 
sensitivity matrices x — yif- 

4.2 Sensitivity Analysis 

For the sensitivity analysis, we follow II. Hereafter all our analysis will 
be carried using data set 4 and the best estimate 9 obtained for the latter. We 
fold that the model is sensitive mainly to four parameters: kf , kj , k* n . k^j. 
The sensitivities for the remaining parameters are on an order of magnitude 
of 1CT 6 or less. It also shows some sensitivity with respect to x\. However, 
the parameter x\ appears in the model only as factor Xii max - The sensitivities 
depicted below use 9 for the nine best fit GLS parameters , i.e., 9 for Kg = 9. 


Sensitivity for k 


Sensitivity for kj 1- 




(a) (b) 

Figure 9: (a) Sensitivity w.r.t. kj ; (b) Sensitivity w.r.t. 
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Figure 10: (a) Sensitivity w.r.t. k^ n \ (b) Sensitivity w.r.t. k^j 




(a) (b) 

Figure 11: (a) Sensitivity w.r.t. (b) Sensitivity w.r.t. k^jf 
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Sensitivity for x 1 


Sensitivity for x g 



Figure 12: (a) Sensitivity w.r.t. X\ ; (b) Sensitivity w.r.t. X2 




(a) (b) 

Figure 13: (a) Sensitivity w.r.t. i max ; (b) Sensitivity w.r.t. Xu = i ma x x 1 
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5 Sensitivity Motivated Inverse Problems 

Based on the sensitivity findings depicted above, we investigated a series of 
inverse problems in which we attempted to estimate an increasing number of 
parameters beginning first with the fundamental parameters kf and kj. In 
each of these inverse problems we attempted to ascertain uncertainty bounds 
for the estimated parameters using both the asymptotic theory described 
above and a generalized least squares version of bootstrapping [12] [131 13 

El ESI- 

A quick outline of the appropriate bootstrapping algorithm is given next. 


5.1 Bootstrapping Algorithm: Nonconstant Variance 
Data 

We suppose now that we are given experimental data (ti, yi(t n , y n ) 
from the underlying observation process 

Y i ^M(t i -e 0 ) + M(t i] 9 0 ys i , (13) 

where i — 1 ,..., n and the £ t are i.i.d. with mean zero and constant variance 
ctq. Then we see that E(y) = M(tf,6 0 ) and Var(Yj) = <Tq M 2ry (ti,9 0 ), with 
associated corresponding realizations of Y) given by 

Vi — M(ti] 6 q) + M(ti\ 9 0 ) 7 Cj. 

A standard algorithm can be used to compute the corresponding boot¬ 
strapping estimate Oboot of 9q and its empirical distribution. We treat the 
general case for nonlinear dependence of the model output on the parame¬ 
ters 9. The algorithm is given as follows. 

1 . First obtain the estimate 9° from the entire sample {?/*} using the GLS 
given in (fT2|) with 7 = 1 . An estimate 9b 00 t can be solved for iteratively 
as follows. 


2. Define the nonconstant variance standardized residuals 




Vi - 

MtuJy ’ 


i — 1, 2,... ,77.. 


Set m = 0. 
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3. Create a bootstrapping sample of size n using random sampling with 
replacement from the data (realizations) {si,...,s n } to form a boot¬ 
strapping sample {s™,..., s™}. 

4. Create bootstrapping sample points 

y? = M{U,P) + Mfa&ys?, 


where i — 1 ,.. . ,n. 

5. Obtain a new estimate Q m+1 from the bootstrapping sample {y™} using 
GLS. 

6 . Set m — m + 1 and repeat steps 3-5 until m > M where M is large 
(e.g., M=1000). 

We then calculate the mean, standard error, and confidence intervals 
using the formulae 

a _ J_ nm 

Vboot — M Z_^m= 1 ° J 

Var(9 boot ) = ^ - Lot) T (L ~ Lot), (14) 

SE kidboot) yjv dv(9 b oot)kk • 


where 9b oot denotes the bootstrapping estimator. 

5.2 Estimation of two parameters 

We first carried out estimation for the 2 parameters kf and kj. We use 
the GLS formulation with 7 = 0.6. We fix globally (based on previous 
estimations with DS 4) the parameter values 


07 

A 'on 

h of.f 

umin 
™on 

Umax 

"'on 

Xi 

x 2 

^max 

4616.962 

93.332 

1684.381 

1.5152 • 10 9 

0.0626 

0.859 

3.542 • 10 5 


and used the initial guesses for the parameters given by 



ki 

kj 

Qo 

2.1600 

10.9270 
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We then used the bootstrapping algorithm and obtained the following 
means and standard errors for M = 1000 which, as reported below, com¬ 
pare quite well with the asymptotic theory estimates. The corresponding 
distributions are shown in Figures fl4l and fT5l 



kf (boot) (GLS) 

kj (boot) (GLS) 

kf (asym.p) (GLS) 

kj (asymp)(GLS) 

mean 

2.158 

10.911 

2.157 

10.911 

SE 

0.0044 

0.0247 

0.00396 

0.0225 



Figure 14: Two parameters estimation (kf, kj ). Bootstrapping distribution 
for kf. We use GLS and M=1000 runs. 


5.3 GLS Estimation of 3 Parameters 

We tried next to estimate 3 parameters. We again used the GLS formulation 
with 7 = 0.6. Once again we fixed all the parameters describing the domain 
and the polymerization function k on and we also fixed either or kf n in 
the corresponding inverse problems. 

5.4 GLS Estimation for kf, kf and kff n 

We fixed values as follows: 


K oft 

j Lmin 
™on 

Umax 
^on 

X\ 

X2 

^max 

93.33 

1684.38 

1.5 • 10 y 

0.062 

0.859 

3.5- 10 5 
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Figure 15: Two parameters estimation ( kf , k T ). Bootstrapping distribution 
for kj. We use GLS and M=1000 runs. 

We used as initial parameter values: 



kf 

k : 

Un 

^on 

Qo 

2.1600 

10.9270 

4616.962 


We obtained the estimated parameters together with the corresponding stan¬ 
dard errors, variances and the condition numbers k of the corresponding 
sensitivity matrices for the four data sets as reported below. The 95% con¬ 
fidence results based on the asymptotic theory are also depicted for DS 4 in 
Figure [16] 



kf 

kj 

K on 

SE 

a 

2 

K 

DS1 

2.26 

13.49 

4616.96 

(.012, 

.099,53.925) 

8.52- 

10“ 6 

8.89 • 

O 

o 

DS2 

2.99 

16.20 

4616.96 

(.021, 

.151,56.691) 

9.67- 

10“ 6 

4.37- 

O 

o 

DS3 

2.18 

15.76 

9840.31 

(.011, 

.103,90.466) 

6.45- 

10“ 6 

3.94- 

10 11 

DS4 

2.16 

10.91 

4616.96 

(0.0089, 

0.0649,45.262) 

6.36- 

10“ b 

7.14 • 

o 

o 


To compare these asymptotic results with bootstrapping, we carried out 
bootstrapping with Data Set (DS) 4 for the estimation of kf, kj and kff with 
the same initial values as above. We then obtained the following means and 
standard errors for a run with M = 1000, in comparison to the asymptotic 
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Figure 16: Confidence Intervals 


theory. 



kf (boot) 

k r (boot) 

kKboot) 

kf (asymp) 

k r (asymp) 

kf n (asymp) 

mean 

2.153 

10.887 

4616.962 

2.157 

10.910 

4616.962 

SE 

0.0039 

0.0219 

0.00003 

0.0089 

0.0649 

45.262 


Of particular interest are the values obtained for kf n and the bootstrap¬ 
ping standard errors for kf n which are extremely small. It should be noted 
that the sensitivity of the model output on kf n is also very small. Thus one 
might conjecture that the iterations in the bootstrapping algorithm do not 
change the values of kf n very much and hence one observes the extremely 
small SE that are produced for the bootstrapping estimates. 
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Figure 17: Estimation for kf : kj and k^ n : Bootstrapping distribution for kj 
for GLS and 1000 runs. 



Figure 18: Estimation for kt, k T and ktL: Bootstrapping distribution for kt 
for GLS and 1000 runs. 
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Figure 19: Estimation for kf, kj and k^ n : Bootstrapping distribution for k° n 
for GLS and 1000 runs. 
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5.5 GLS estimation for kj ■> K : and k^j 

In another test, we fixed k^ n and instead estimate k^j (along with k~}~ and 
kj). We use the fixed values: 


pv 

™ on 

h.min 
^on 

L,max 
^on 

X\ 

X2 

^max 

4616.962 

1684.381 

1.5152 • 10 9 

0.0626 

0.859 

3.542 • 10 5 


and the initial guesses for the parameters to be estimated given by: 



kt 

ki 

pv 

K o.f.f 

% 

2.1600 

10.9270 

108.256 


We obtained the estimated parameters and corresponding SE. 



kf 

kj 

pv 

K off 

SE 

a 2 

K 

DS1 

2.203 

12.997 

99.861 

(.011, .091,1.208) 

8.165 • 10~ 6 

4.912 ■ 10 7 

DS2 

2.893 

15.474 

100.019 

(.019, .137,1.279) 

9.323 • 10~ 6 

2.486 • 10 7 

DS3 

2.168 

15.631 

41.935 

(.011, .102, 0.424) 

6.435 • 10~ 6 

9.125 • 10 6 

DSA 

2.181 

11.090 

90.536 

(.009, .066, 0.936) 

6.289 • 10~ 6 

3.043 • 10 7 


Also in this case, we carried out bootstrapping for DS 4. The bootstrap¬ 
ping distributions for A;/, kj and k^j are found in Figures [2011221 We then 
obtained the following means and standard errors for a run with M = 1000 
in comparison to the asymptotic theory. 



k~j (boot) 

kj (boot) 

koff (boot) 

A’/ (asymp) 

k r (asymp) 

k^f (asymp) 

mean 

2.169 

11.013 

91.254 

2.181 

11.090 

90.536 

SE 

0.0094 

0.0699 

1.0392 

0.009 

0.066 

0.936 
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Figure 21: Three parameters estimation (A;/, kj and k^j): 
distribution for kj. We used GLS and M=1000 runs. 


Bootstrapping 


Bootstrapping 
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Figure 22: Three parameters estimation (kf, kj and 
distribution for fc^yy. We used GLS and M=1000 runs. 


Bootstrapping 
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5.6 Estimation of 4 main parameters 

Following the sensitivity analysis detailed above, we tried to estimate a com¬ 
bination of the parameters kf, kj , k° n , k^f f for the parameter set with Kg = 4. 
Parameters as follows were fixed from the original 9 parameter fit: 


l»min 
^on 

h,max 
^on 

Xi 

x 2 

^ max 

1684 

1.5- 10 9 

0.062 

0.859 

3.5- 10 5 


We obtained the following result for the estimation of the four parameters 
using the data sets 1 to 4. In all of them, the condition number of the 
Fischer’s information matrix k is too large to invert. This along with the 
sensitivity results above strongly suggests that the data sets do not contain 
sufficient information to estimate 4 or more parameters with any degree of 
certainty attached to the estimates. 



kt 

ki 

™ on 

pv 

K off 

a 2 

K 

DS1 

2.1431 

12.4751 

4616.962 

108.259 

8.7219 • 10~ 6 

6.1226 • 10 19 

DS2 

2.7995 

14.7630 

4616.957 

108.4308 

9.8694 • 10~ e 

1.4442 • 10 19 

DS3 

2.180 

15.757 

4618.599 

41.369 

6.4622 • lO” 9 

1.881 • 10 17 

DS4 

2.161 

10.9278 

4617.3316 

93.3265 

6.374- IQ’ 6 

2.144 • 10 18 


6 Model Comparison Tests 

A type of Residuals Sum of Squares (RSS) based model selection criterion 
[El S, m can be used as a tool for model comparison for certain classes of 
models. In particular this is true for models such as those given in [3j in which 
potentially extraneous mechanisms can be eliminated from the model by a 
simple restriction on the underlying parameter space while the form of the 
mathematical model remains unchanged. In other words, this methodology 
can be used to compare two nested mathematical models where the parameter 
set flff (this notation will be defined explicitly in Section I6.il below) for 
the restricted model can be identified as a linearly restricted subset of the 
admissible parameter set fig of the unrestricted model. Indeed, the RSS 
based model selection criterion is a useful tool to determine whether or not 
certain terms in the mathematical models are important in describing the 
given experimental data. 
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6.1 Ordinary Least Squares 

We now turn to the statistical model (l9j) . where the measurement errors are 
assumed to be independent and identically distributed with zero mean and 
constant variance a 2 . In addition, we assume that there exists do such that 
the statistical model 

Yj = M ( tj ; 9 0 ) +£j, j = 1,2,..., n. (15) 

correctly describes the observation process. In other words, (fT51) is the true 
model, and do is the true value of the mathematical model parameter d. 

With our assumption on measurement errors, the mathematical model 
parameter 9 can be estimated by using the ordinary least squares method; 
that is, the ordinary least squares estimator of 9 is obtained by solving 

d n = arg min J n (d' Y). 
een.0 

Here Y = (Y\. Y 2 ,..., Y n ) T , and the cost function J n is defined as 

1 n 

J n (d-,Y) = - J2(Y k - M(t k -d)) 2 . 
n 

k =1 

The corresponding realization d n of d n is obtained by solving 

d n = arg min J n (d] y), 

where y is a realization of Y (that is, y = (yi, y 2 ,..., y n ) T )- 

As alluded to in the introduction, we might also consider a restricted 
version of the mathematical model in which the unknown true parameter 
is assumed to lie in a subset Qff C He of the admissible parameter space. 
We assume this restriction can be written as a linear constraint, Udo = h, 
where T~L € i s a matrix having rank K r (that is, K r is the number of 

constraints imposed), and h is a known vector. Thus the restricted parameter 
space is 

= {d G He : Ud = h} . 

Then the null and alternative hypotheses are 

Hq ■ d 0 G H^ 

H a -. do£ fif. 
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We may define the restricted parameter estimator as 


9 n ’ H = arg min J n (0- Y), 

and the corresponding realization is denoted by 9 n,H . Since Qjf C fig, it is 
clear that 

J n (hy)<J n (d n ’ H ; y). 

This fact forms the basis for a model selection criterion based upon the 
residual sum of squares. Using the standard assumptions (given in detail in 
0), one can establish asymptotic convergence result for the test statistics 
(which is a function of observations and is used to determine whether or not 
the null hypothesis is rejected) 

TTn _ n ( J n (d n ' H - Y) - J n (9 n ; Y)) 

U ~ J n (9 n ; Y) 

where the corresponding realization U n is dehned as 

n (j n (9 n,H ; y) — J n (9 n ;y)] 

U n = — - 77 ];----• ( 16 ) 

J n (9 n ] y) 

This asymptotic convergence result is summarized in the following theorem. 

Theorem 6.1. Under assumptions detailed in jaiizy and assuming the null 
hypothesis H 0 is true, then U n converges in distribution (as n —>■ oo) to a 
random variable U having a chi-square distribution with n r degrees of free¬ 
dom. 

The above theorem suggests that if the sample size n is sufficiently large, 
then U n is approximately chi-square distributed with K r degrees of freedom. 
We use this fact to determine whether or not the null hypothesis H$ is re¬ 
jected. To do that, we choose a significance level a (usually chosen to be 
0.05) and use y 2 tables to obtain the corresponding threshold value r so that 
Prob{U > t) = a. We next compute U n and compare it to r. If U n > r, 
then we reject the null hypothesis Hq with confidence level (1 — a) 100%; 
otherwise, we do not reject. We emphasize that care should be taken in stat¬ 
ing conclusions: we either reject or do not reject H 0 at the specified level of 
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confidence. The table below illustrates the threshold values for y' 2 (l) with 
the given significance level. 


a 

r 

confidence level 

.25 

1.32 

75% 

.1 

2.71 

90% 

.05 

3.84 

95% 

.01 

6.63 

99% 

.001 

10.83 

99.9% 


Similar tables can be found in any elementary statistics text or online or 
calculated by some software package such as Matlab, and is given here for 
illustrative purposes and also for use in the examples demonstrated below. 

6.2 Generalized Least Squares 

The model comparison results outlined can be extended to deal with gener¬ 
alized least squares problems in which measurement errors are independent 
with E(£fc) = 0 and Var(Sk) = a 2 w 2 (tk, 6), k — 1,2,... ,n, where w is some 
known real-valued function with w(t,9) % 0 for any t. This is achieved 
through rescaling the observations in accordance with their variance (as dis¬ 
cussed in [9]) so that the resulting (transformed) observations are identically 
distributed as well as independent. 

6.3 Results for PolyQ Aggregation Models 

We then carried out a series of model comparison tests (we again used DS 4) 
for nested models to determine if an added parameter yields a statistically 
significantly improved model fit. Our null hypothesis in each case was: H 0 : 
The restricted model is adequate (i.e., the fit-to-data is not significantly 
improved with the model containing the additional parameter as a parameter 
to be estimated). We obtained the following results. 

1. Model with estimation of {kf,kj} vs. the model with estimation 
of {k^, kj,k^j} : We find with n=699, J n (0^-;Y) = .0044192109, 

J n (9 n ] Y) = .0043709501 and U n = 7.7178. Thus we reject H 0 at a 
99% confidence level. 
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2. Model with estimation of {kj,k T } vs. the model with estimation of 
{kj,kj,k^} : We find J n (6 n ; Y) = .0044192108 with U n = 7.49 x 
1 q -°6 xhus we don’t reject H 0 at a 99% confidence level. 

3. Model with estimation of {kj, kj , k vs. the model with estimation 
of {kj,kj,kj ff ,kj n } : To the order of computation we find no difference 
in the cost functions in this case and therefore we do not reject Ho at 
a confidence level of 99%. 

4. Model with estimation of {kj , kj , kj n } vs. the model with estimation of 
{kj, kj, k* n , k? ff } : We find J n {9 n ; Y) = .0043709780 with U n = 7.7133 
and hence we reject Ho with a confidence level of 99%. 

From these and the preceding results we conclude the information content 
of the typical data set for the dynamics considered here will support at most 
3 parameters estimated with reasonable confidence levels and these are the 
parameters {kj, kj , kjjj }. 

7 Conclusions and Suggested Further Efforts 

For the efforts reported on above we make several conclusions. 

For the majority of data sets, the GLS residual plots with 7 = 0.6 are 
random when fitted for data points M(tk ) > 0 . 12 . As conjectured earlier, 
this may be because the early formation of aggregates is somewhat stochas¬ 
tic in nature which is not well described by either the mathematical and/or 
statistical models. It appears that one needs special consideration of smaller 
polymer sizes. Indeed we suspect from additional discussions with our col¬ 
leagues that perhaps the nucleation step might be dominated by a stochastic 
rather than deterministic process in the early stages (i.e., for small polymer 
sizes). This is a possible direction of further investigation. 

Based on several different mathematical/statistical methodologies (sen¬ 
sitivities, asymptotic analysis, bootstrapping, model comparison tests), the 
data sets we considered do not contain sufficient information for the reliable 
estimation of all 9 parameters of interest. Indeed our findings suggest that 
at most 3 parameters can be reliably estimated with the data sets typical 
of those presented here, and that these parameters are {kj, kj, kj-j}. Re¬ 
cently related efforts [ 2 ] suggest that perhaps there are experimental design 
questions that could be addressed to collect data that might support the 
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more sophisticated models derived in [23], especially in order to investigate 
information coming from different initial concentrations. Indeed, we have 
considered here data sets related to experiments carried out with the same 
initial concentration. Adapting the previously used techniques to simulta¬ 
neously or successively use all the information content in data sets carried 
out for different initial concentration is a challenging problem (see [22] for a 
discussion of the effect of initial concentration on nucleated polymerization). 

Here we conclude that at most 3 parameters {kf, kj , k^j} can be reliably 
estimated with the data sets investigated. The two first parameters determine 
the balance between the normal and abnormal protein concentrations and 
the third represents the stability of the nucleus against the degradation into 
monomeric entities. These three parameters are related to the early steps 
of the aggregation process, and thus we conclude that the model applied 
to these data sets does not provide any insight into the polymerization of 
larger polymers. Since this is the case, there is little motivation to modify 
the polymerization function depicted in Figure [2] until further data collection 
procedures are pursued. 
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